1. Introduction
This document contains the details of my re-analysis of the sample mixups in the Attie eQTL data, to accompany the manuscript describing the work.
I will suppress much of the R code from the html version of this document; for details, see the asciidoc source file.
Here are the versions of R and the packages that I’m using:
> R.version$version.string
[1] "R version 3.2.2 (2015-08-14)"
> library(qtl)
> qtlversion()
[1] "1.36-6"
> library(lineup)
> lineupversion()
[1] "0.37-6"
> library(broman)
> bromanversion()
[1] "0.59-5"
> library(ascii)
> paste(unlist(packageVersion("ascii")), collapse=".") # ascii version
[1] "2.1"
2. Preliminary data cleaning
2.1. Genotype data
There were originally 554 mice. There were 3 mice with failed genotypes. We omitted the genotypes for another 7 mice, as the genotype data were bad. (They showed a high rate of apparent genotyping errors, an unusually large proportion of homozygous gneotypes, and many apparent crossovers.) That leaves 544 mice with genotype data.
The genotype data were cleaned to remove markers not segregating in the cross, to correctly align the BTBR and B6 alleles, and to remove some badly behaved markers (data not shown). Remaining were 2060 markers, including 20 on the X chromosome.
2.1.1. Sex swaps
The cross performed was (BTBR × B6) × (BTBR × B6), with females listed first. Thus, females should be homozygous BTBR or heterozygous while males are hemizygous B6 or BTBR.
We identified 33 mice whose X chromosome genotypes didn’t match their sex: There were 19 females whose genotype data indicated them to be males (many homogyzous B6 calls and no hets) and 14 males whose genotype data indicated them to be females (many heterozygous calls).
There are 112 mice with BTBR homozygous calls for the entire X chromosome, which is compatible with both sexes.
2.1.2. Sample duplicates
We identified 6 pairs of clear sample duplicates in the genotype data.
Mouse1 |
Mouse2 |
# matches |
# typed |
% mismatches |
Mouse3259 |
Mouse3269 |
2017 |
2022 |
0.2 |
Mouse3267 |
Mouse3362 |
1933 |
1966 |
1.7 |
Mouse3287 |
Mouse3290 |
2012 |
2016 |
0.2 |
Mouse3317 |
Mouse3318 |
1964 |
1996 |
1.6 |
Mouse3353 |
Mouse3354 |
2026 |
2031 |
0.2 |
Mouse3553 |
Mouse3559 |
1998 |
2008 |
0.5 |
One pair, Mouse3267 and Mouse3362, were actually omitted from the genotype data as being badly behaved.
We omit one individual from each pair from the genotype data.
2.2. Gene expression data
We have gene expression microarray data on six tissues: adipose, gastroc, hypo, islet, kidney, and liver. These were two-color Agilent arrays, with the probes being 60-mers. For each array, one color was a tissue-specific pool and the other was an individual sample. The expression levels were quantified as the “mlratio”, which is approximately the log10 ratio of the individual to the pool, though with some corrections that I don’t fully understand, and with values always between -2 and +2.
There were approximately 500 expression arrays performed for each tissue. A number of outlying arrays were omitted from each group. For hypo, there was an additional batch of 119 poorly behaved arrays that were omitted in later analyses, but they are included here as they contain a number of detectable sample mixups.
# arrays |
# omitted |
# kept |
|
adipose |
497 |
4 |
493 |
gastroc |
498 |
2 |
496 |
hypo |
494 |
1 |
493 |
islet |
499 |
1 |
498 |
kidney |
482 |
1 |
481 |
liver |
491 |
1 |
490 |
3. Lining up expression arrays
In this section, we line up the expression arrays.
The first step is to identify probes that are correlated between tissues. (For each array, there is a total of 40572 probes.) For each pair of arrays, we find the samples that are in common and calculate the correlation between tissues for each probe.
The following figure contains density estimates of the correlation distributions for the 15 pairs of tissues.
Here are counts, for each tissue pair, of probes with large between-tissue correlation.
Tissue1 |
Tissue2 |
corr > 0.7 |
corr > 0.75 |
corr > 0.8 |
corr > 0.9 |
adipose |
gastroc |
199 |
143 |
99 |
30 |
adipose |
hypo |
110 |
72 |
50 |
7 |
adipose |
islet |
216 |
159 |
106 |
38 |
adipose |
kidney |
255 |
186 |
135 |
51 |
adipose |
liver |
159 |
113 |
79 |
19 |
gastroc |
hypo |
79 |
55 |
43 |
10 |
gastroc |
islet |
180 |
132 |
92 |
33 |
gastroc |
kidney |
219 |
164 |
109 |
43 |
gastroc |
liver |
149 |
102 |
71 |
23 |
hypo |
islet |
127 |
82 |
57 |
10 |
hypo |
kidney |
131 |
92 |
60 |
17 |
hypo |
liver |
63 |
46 |
33 |
6 |
islet |
kidney |
269 |
200 |
146 |
42 |
islet |
liver |
152 |
97 |
64 |
24 |
kidney |
liver |
245 |
155 |
106 |
30 |
For each pair of tissues, we select the probes with between-tissue correlation > 0.75 and then calculate the between-individual correlations (the correlation between an individual in one tissue and an individual in the second tissue), using that subset of probes.
The selection of probes with high cross-tissue correlation is intended to pull out probes with strong signal. The between-individual correlations indicates the appropriate alignment of samples.
Then, for each tissue, we summarize the five correlation matrices, comparing that tissue to each other tissue, by taking, for each pair of individuals, the median of the five tissue-tissue correlations. This relies on the assumption that different samples are mixed up in the different tissues.
3.1. Adipose
There were 493 individuals assayed for adipose expression, after omitting 4 bad arrays. The analysis described above produces a 493 × 527 of median correlations, with the rows corresponding to the adipose samples.
The following figure contains histograms of the self-self correlations and self-non self correlations (or really medians of correlations). The self-self correlations are mostly large, but there are 5 samples with median self-self correlations < 0.5. The two modes in the self-nonself correlations correspond to mixed-sex and same-sex pairs.
For each row, we pull out the maximum correlation (or really the maximum of the median correlations); in the following, we plot the self-self correlations against these maxima. For most samples, the self-self correlation is the maximum; these samples appear to be aligned correctly. The 5 samples with problems are highlighted in green, and we indicate the sample label and then what we infer to be the correct label.
For the 5 problem samples, the self-self correlation is low but in each case there is another sample for which correlation is high. We infer that Mouse3583 and Mouse3584 were swapped, and then there is a rotated trio, Mouse3187 → Mouse3188 → Mouse3200 → Mouse3187.
Let’s look at detailed scatterplots of the expression data for the sample swaps. We pull out the probes that show correlation > 0.6 between adipose and at least one other tissue, and then make a scatter plot of a sample from one tissue against a sample from the second tissue. Points correspond to probes, with the sizes of the points indicating the probes' correlations, across individuals, for that pair of tissues.
First, Mouse3583 and Mouse3584. You can see that, in all tissues, Mouse3583 is not correlated with itself but is correlated with Mouse3584, and vice versa.
Now the rotated trio, Mouse3187 → Mouse3188 → Mouse3200 → Mouse3187. Note that Mouse3188 was swapped with Mouse3179 in hypo (see below). Mouse3187 in hypo seems correct but poorly behaved. Mouse3187 and Mouse3188 were not assayed for kidney expression.
3.2. Gastroc
There were 496 individuals assayed for gastroc expression, after omitting 2 bad arrays.
We jump to the plot of the self-self correlations versus the maximum correlation. Again, for most samples, the self-self correlation is the maximum; these samples appear to be aligned correctly. The 2 samples with problems are highlighted in green, and we indicate the sample label and then what we infer to be the correct label.
There is an additional potential problem highlighted in orange, having low self-self correlation, and being slightly more correlated to another sample than to itself. But this is Mouse3188, which was involved in sample mix-ups in both adipose and hypo. We believe it’s correctly labeled in gastroc.
The 2 problem samples are inferred to be a sample swap, Mouse3655 ↔ Mouse3659.
Let’s look at detailed scatterplots of the expression data for the sample swap, Mouse3655 and Mouse3659. You can see that, in all tissues, Mouse3655 is not correlated with itself but is correlated with Mouse3659, and vice versa.
3.3. Hypo
There were 493 individuals assayed for hypo expression, after omitting 1 bad array.
Here is the plot of the self-self correlations versus the maximum correlation. Again, for most samples, the self-self correlation is the maximum; these samples appear to be aligned correctly. The 17 samples with problems are highlighted in green, but there are too many to attach labels.
Here are the summary statistics for the problems.
maximum corr |
self corr |
Inferred label |
|
Mouse3592 |
0.94 |
0.57 |
Mouse3594 |
Mouse3594 |
0.94 |
0.57 |
Mouse3592 |
Mouse3369 |
0.94 |
0.28 |
Mouse3367 |
Mouse3367 |
0.86 |
0.41 |
Mouse3369 |
Mouse3590 |
0.94 |
0.72 |
Mouse3589 |
Mouse3589 |
0.93 |
0.73 |
Mouse3590 |
Mouse3454 |
0.93 |
0.46 |
Mouse3452 |
Mouse3452 |
0.89 |
0.42 |
Mouse3454 |
Mouse3451 |
0.93 |
-0.22 |
Mouse3449 |
Mouse3449 |
0.91 |
-0.22 |
Mouse3451 |
Mouse3382 |
0.85 |
0.54 |
Mouse3381 |
Mouse3347 |
0.84 |
0.16 |
Mouse3348 |
Mouse3348 |
0.75 |
0.09 |
Mouse3347 |
Mouse3179 |
0.75 |
0.15 |
Mouse3188 |
Mouse3188 |
0.74 |
0.27 |
Mouse3179 |
Mouse3210 |
0.72 |
0.25 |
Mouse3208 |
Mouse3208 |
0.70 |
0.23 |
Mouse3210 |
Mouse3589 and Mouse3590 have reasonably high self-self correlation (0.7) but are each much more highly correlated to the other (0.9).
We infer a set of nine sample swaps:
Mouse3179 ↔ Mouse3188 |
Mouse3208 ↔ Mouse3210 |
Mouse3347 ↔ Mouse3348 |
Mouse3367 ↔ Mouse3369 |
Mouse3381 ↔ Mouse3382 |
Mouse3449 ↔ Mouse3451 |
Mouse3452 ↔ Mouse3454 |
Mouse3589 ↔ Mouse3590 |
Mouse3592 ↔ Mouse3594 |
There is one wrinkle here: Mouse3381 was not assayed for hypo expression, but really it was Mouse3382 that was not assayed; Mouse3381 was assayed but was labeled Mouse3382.
Let’s look at detailed scatterplots of the expression data for all of these sample swaps.
First, Mouse3179 ↔ Mouse3188. Recall that Mouse3188 was part of a rotated trio in adipose (Mouse3187 → Mouse3188 → Mouse3200 → Mouse3187). These are not too pretty, but it is a clear swap Mouse3179 ↔ Mouse3188 in hypo.
Mouse3208 ↔ Mouse3210: These are rather messy, but it is still a clear sample swap.
Mouse3347 ↔ Mouse3348: Again, these are rather messy, but still a clear sample swap.
Mouse3367 ↔ Mouse3369
Mouse3381 ↔ Mouse3382 (but note that Mouse3381 was not intended to be done in hypo).
Mouse3449 ↔ Mouse3451
Mouse3452 ↔ Mouse3454
Mouse3589 ↔ Mouse3590
Mouse3592 ↔ Mouse3594
3.4. Islet
There were 498 individuals assayed for islet expression, after omitting 1 bad array.
Here is the plot of the self-self correlations versus the maximum correlation. Again, for most samples, the self-self correlation is the maximum; these samples appear to be aligned correctly. The 3 samples with problems are highlighted in green, and we indicate the sample label and then what we infer to be the correct label.
There is an additional potential problem highlighted in orange, having low self-self correlation, and being slightly more correlated to another sample than to itself, but this is Mouse3188 again, which was involved in sample mix-ups in both adipose and hypo. We think it is correctly labeled in islet.
The 3 problem samples are inferred to be a sample swap, Mouse3598 ↔ Mouse3599, plus a sample duplicate: Mouse3295 was run correctly but also with the label Mouse3296 (more on this below).
Let’s look at detailed scatterplots of the expression data for the sample mixups.
First, the sample swap, Mouse3598 ↔ Mouse3599. You can see that, in all tissues, Mouse3598 is not correlated with itself but is correlated with Mouse3599, and vice versa.
Now we turn to the apparent duplicate, the Mouse3296 sample really being Mouse3295. Mouse3296 in islet is not correlated with itself but is correlated with Mouse3295. Mouse3295 in islet looks fine: It’s correlated with itself and not correlated Mouse3296 in the other tissues. We’ll revisit this below in the section on Sample duplicates.
3.5. Kidney
There were 481 individuals assayed for kidney expression, after omitting 1 bad array. Note that there were 27 samples that were assayed for kidney expression but no other tissue.
Here is the plot of the self-self correlations versus the maximum correlation. Again, for most samples, the self-self correlation is the maximum; these samples appear to be aligned correctly. The 2 samples with problems are highlighted in green, and we indicate the sample label and then what we infer to be the correct label.
There is an additional potential problem highlighted in orange (Mouse3484), being a bit more correlated to another sample (Mouse3503) than to itself. These two arrays are oddly behaved and will be investigated further below.
The 2 problem samples are inferred to be a sample swap, Mouse3510 ↔ Mouse3523.
Let’s look at detailed scatterplots of the expression data for the sample mixups.
First, the sample swap, Mouse3510 ↔ Mouse3523. You can see that, in all tissues, Mouse3510 is not correlated with itself but is correlated with Mouse3523, and vice versa.
Now we show scatterplots for the odd pair, Mouse3484 and 3503. They all seem correlated with each other. We’ll revisit this below in the section on Sample duplicates.
3.6. Liver
There were 490 individuals assayed for liver expression, after omitting 1 bad array.
Here is the plot of the self-self correlations versus the maximum correlation. The 2 samples with problems are highlighted in green, and we indicate the sample label and then what we infer to be the correct label.
The 2 problem samples are inferred to be a sample swap, Mouse3142 ↔ Mouse3143 (though Mouse3143 was not intended to be assayed for liver), and a sample duplicate: Mouse3136 was assayed correctly and also with the label Mouse3141 (more below).
Let’s look at detailed scatterplots of the expression data for the sample mixups.
First, the sample swap, Mouse3142 ↔ Mouse3143, but with Mouse3143 not intended to be assayed in liver. You can see that, in all tissues, Mouse3142 is not correlated with itself but is correlated with Mouse3143.
Now we turn to the apparent duplicate, the Mouse3141 sample really being Mouse3136. We’ll revisit this in the next section, on Sample duplicates.
3.7. Sample duplicates
Above, we had identified several apparent sample duplicates (which might be called “unintended technical replicates”). We now look at these in more detail.
Consideration of all probes on an array will not be informative, as most genes are not expressed in a tissue and so the measurements are just noise. To identify probes that with signal, we pull out those probes with correlation > 0.6 against at least one other tissue.
For each tissue, there are non-duplicate pairs with correlations that reach 0.9, while each of islet, kidney and liver have a pair of samples with correlation > 0.98.
Here are the top 5 between-tissue correlations for each tissue.
adipose |
gastroc |
hypo |
islet |
kidney |
liver |
0.86 |
0.90 |
0.89 |
0.98 |
0.99 |
0.98 |
0.85 |
0.88 |
0.88 |
0.86 |
0.87 |
0.89 |
0.85 |
0.88 |
0.87 |
0.86 |
0.87 |
0.87 |
0.85 |
0.88 |
0.87 |
0.86 |
0.86 |
0.87 |
0.84 |
0.87 |
0.87 |
0.84 |
0.86 |
0.87 |
In islet, Mouse3296 is a duplicate of Mouse3295.
In liver, Mouse3141 is a duplicate of Mouse3136.
In kidney, Mouse3484 and Mouse3503 remain a bit of a conundrum, since they each are correlated with the other in all tissues.
This feature is specific to the two kidney samples. For example, consider the between-individual correlations for all 15 tissue pairs. Here, we’ll grab the subset of 76 probes with correlation > 0.6 for all tissue pairs.
Here are plots of the 15 between-tissue correlations for these two mice. The pairs involving kidney are in red.
I still don’t know what to make of this. It’s interesting to note that the red points are at about the same height in all four figures, while the blue points are high for 3484:3484 and 3503:3503 but low for 3484:3503 and 3503:3484.
Why should this pair of kidney samples have such similar expression? I’m inclined to think that a mixture of the two RNAs were placed on both arrays.
I will omit these two arrays from further analyses.
3.8. Summary
Here’s a summary of the arrays omitted and found do be in mislabeled, for each tissue.
total |
omit |
okay |
error |
|
adipose |
497 |
4 |
488 |
5 |
gastroc |
498 |
2 |
494 |
2 |
hypo |
494 |
1 |
476 |
17 |
islet |
499 |
1 |
495 |
3 |
kidney |
482 |
3 |
477 |
2 |
liver |
491 |
1 |
488 |
2 |
I now omit the relevant arrays, combine the duplicates in liver and islet (by simply taking the average of the mlratios for each probe) and then re-name the arrays as inferred.
4. Line up expression with genotypes
We now turn to the alignment of the expression arrays with the genotype data.
We first pull out all probes that have an annotated genomic position to an autosome. There are 36364 such probes.
We next identify the nearest marker or pseudomarker for each probe. QTL genotype probabilities were calculated at the markers and at 0.25 cM steps along the genome.
We then calculate a local LOD score for each probe in each tissue, at the marker/pseudomarker closest to the probe.
Here is a table of the numbers of probes in each tissue with local LOD > 100.
adipose |
gastroc |
hypo |
islet |
kidney |
liver |
103 |
88 |
48 |
135 |
88 |
54 |
We grab the probes with local LOD score > 100, construct a k-nearest neighbor classifier, for classifying eQTL genotype from expression, for each eQTL, and finally calculate the proportion mismatches between predicted and observed eQTL genotypes. (We use k=20 and a minimum class probability of 0.8 to make a decision.)
Our measure of distance between samples is the proportion of mismatches between the observed and predicted eQTL genotypes. Consider the distance matrix with DNA samples as rows and mRNA samples as columns. It’s particularly informative to make a scatterplot of the minimum value in a row against the self-self distance. These plots are analogous to the plots of maximum correlation vs. self-self correlation that we had considered in aligning the expression arrays.
In the following, we display such a plot for each tissue considered separately, plus with the distances matrices combined (by simply taking the overall proportion of mismatches, across all tissues).
In each case, the majority of samples are correctly labeled; for these, with the results for all tissues combined, the self-self distance is the minimal distance (the points in blue). There are 435 such samples.
The green points correspond to samples that were incorrectly labeled, but that are fixable: the self-self distance is large, indicating a clear error, but the minimal distance is small; there is an mRNA sample to which the DNA sample appears to correpond. With the results combined across tissues, there are 72 such samples.
The red points correspond to samples that were incorrectly labeled and not fixable: the self-self distance is large, indicating a clear error, and the minimal distance is also large; there is no mRNA sample to which the DNA sample appears to correspond. With the results combined across tissues, there are 12 such samples.
There are an additional 20 DNA samples for which expression assays were not intended to be done. These are not represented in the above figure, as their self-self distances are missing. However, we can calculate the minimum distance between these DNA samples and all mRNA samples. There are 12 such DNA samples that have < 10% mismatches with one of the mRNA sample; these likely are fixable sample mixups. The other 8 DNA samples have no mRNA sample that seems to correspond and so may be correct (though we cannot be certain).
4.1. Potential discrepancies
Note that the points are colored based on the results for the six tissues combined. For each tissue there are numerous green points (with the combined data, viewed as “fixable”) that have high minimum mismatches. These are cases where the inferred sample was not run in that particular tissue but was run in others: For the most part, these are samples that were run only in kidney or run in other tissues but not kidney. Hypo is a bit messy, though, with less-clear separation between “fixable” and “not fixable”.
Here is a table of these discrepancies. The numbers are the minimal percent mismatches between observed and inferred eQTL genotypes, by tissue and then with all tissues combined. The “Note” column explains why some tissues show a discrepancy; most are due to missing expression data.
Sample |
Inferred |
Note |
adipose |
gastroc |
hypo |
islet |
kidney |
liver |
combined |
3329 |
3328 |
all but adipose |
38 |
1 |
3 |
0 |
0 |
0 |
1 |
3382 |
3381 |
all but kidney |
1 |
2 |
16 |
1 |
38 |
5 |
3 |
3369 |
3367 |
all but kidney |
0 |
5 |
0 |
1 |
33 |
0 |
1 |
3367 |
3365 |
all but kidney |
1 |
4 |
3 |
3 |
35 |
2 |
3 |
3334 |
3333 |
all but kidney |
0 |
1 |
0 |
0 |
35 |
0 |
0 |
3385 |
3384 |
all but kidney |
4 |
1 |
12 |
2 |
42 |
2 |
3 |
3363 |
3361 |
all but kidney |
1 |
4 |
3 |
1 |
36 |
0 |
2 |
3356 |
3354 |
all but kidney |
2 |
4 |
3 |
2 |
30 |
0 |
2 |
3368 |
3366 |
all but kidney |
7 |
0 |
9 |
1 |
34 |
0 |
3 |
3352 |
3351 |
all but kidney |
2 |
1 |
6 |
3 |
25 |
0 |
2 |
3355 |
3353 |
all but kidney |
5 |
6 |
0 |
5 |
42 |
3 |
4 |
3567 |
3559 |
all but kidney |
0 |
3 |
0 |
1 |
31 |
2 |
1 |
3406 |
3405 |
all but liver |
0 |
1 |
0 |
1 |
0 |
32 |
1 |
3320 |
3319 |
many sloppy |
11 |
3 |
10 |
10 |
16 |
6 |
10 |
3326 |
3325 |
missing gastroc and kidney |
4 |
36 |
14 |
3 |
31 |
0 |
4 |
3344 |
3343 |
missing hypo and kidney |
5 |
0 |
31 |
1 |
37 |
4 |
2 |
3395 |
3394 |
only kidney |
35 |
36 |
35 |
34 |
4 |
35 |
4 |
3397 |
3396 |
only kidney |
32 |
39 |
24 |
32 |
4 |
27 |
4 |
3339 |
3338 |
only kidney |
37 |
37 |
29 |
38 |
0 |
38 |
0 |
3408 |
3407 |
only kidney |
28 |
24 |
26 |
30 |
1 |
32 |
1 |
3407 |
3406 |
only kidney |
45 |
40 |
35 |
47 |
1 |
41 |
1 |
3399 |
3398 |
only kidney |
42 |
35 |
35 |
36 |
0 |
33 |
0 |
3400 |
3399 |
only kidney |
31 |
36 |
30 |
32 |
1 |
35 |
1 |
3393 |
3390 |
only kidney |
34 |
36 |
27 |
35 |
6 |
29 |
6 |
3396 |
3395 |
only kidney |
39 |
41 |
33 |
42 |
3 |
41 |
3 |
3398 |
3397 |
only kidney |
32 |
33 |
32 |
40 |
1 |
28 |
1 |
3378 |
3377 |
only kidney |
35 |
33 |
20 |
36 |
0 |
33 |
0 |
3390 |
3389 |
only kidney |
33 |
24 |
38 |
27 |
0 |
27 |
0 |
3401 |
3400 |
only kidney |
40 |
43 |
37 |
41 |
1 |
31 |
1 |
3405 |
3404 |
only kidney |
31 |
24 |
23 |
26 |
1 |
24 |
1 |
3359 |
3357 |
sloppy hypo |
3 |
8 |
33 |
0 |
0 |
5 |
5 |
3366 |
3364 |
sloppy hypo |
1 |
1 |
29 |
2 |
0 |
0 |
3 |
3374 |
3373 |
sloppy hypo |
4 |
4 |
18 |
2 |
3 |
0 |
4 |
3351 |
3350 |
sloppy hypo |
1 |
3 |
17 |
1 |
1 |
5 |
3 |
3381 |
3378 |
sloppy hypo |
4 |
1 |
32 |
3 |
1 |
8 |
6 |
3370 |
3368 |
sloppy hypo |
2 |
4 |
16 |
2 |
7 |
7 |
5 |
3365 |
3363 |
sloppy hypo |
3 |
2 |
18 |
2 |
0 |
2 |
3 |
3260 |
3286 |
sloppy liver |
4 |
3 |
3 |
7 |
3 |
21 |
6 |
There are two additional samples that deserve to be highlighted, which are two blue points (inferred to be correctly labeled) that are off the diagonal line in one tissue:
-
Mouse3270 shows 39% mismatches (18/46) in liver, and 8% mismatches (34/403) overall.
-
Mouse3142 shows 26% mismatches (9/34) in hypo, and 4% mismatches (14/372) overall.
In both of these cases, I’m inclined to conclude that the expression array data may be of low quality, but the samples are correctly labeled.
4.2. Fix up sample duplicates
The re-alignment of the DNA samples requires that we go back to the inferred sample duplicates and relabel some of them. In particular, the sample for Mouse3269 was found to be a duplicate of the sample for Mouse3259, but the sample for Mouse3259 was actually for Mouse3332, and so both are really Mouse3332. Similarly, the sample for Mouse3354 was a duplicate of the sample for Mouse3353, but the sample for Mouse3353 was actually Mouse3352, and so both are really Mouse3352.
4.3. Summary
Here’s a summary of the problems observed in the genotype data.
okay |
435 |
omitted |
10 |
duplicate |
5 |
maybe okay |
8 |
error |
96 |
total |
554 |
5. X chromosome and sex
5.1. Genotypes
I originally came to these observations by identifying a set of samples for which sex did not match the X chromosome genotype. It would be good to revisit that issue: with the genotypes realigned, do the sexes now match what was observed in the X chromosome data?
We first grab the set of DNAs for which we are now sure of the appropriate labels. (There are 519 such.)
We want to compare the inferred sex from the X chromosome genotypes for the samples to the documented sex (attached to the correct label).
Of the 519 DNA samples that are correct or can be corrected, 23 showed a mismatch between sex and X chromosome genotypes. With the corrected sample labels, there are 3 discrepancies:
Original ID |
Corrected ID |
Sex from genotypes |
Sex from necropsy |
Mouse3368 |
Mouse3366 |
Male |
Female |
Mouse3393 |
Mouse3390 |
Male |
Female |
Mouse3399 |
Mouse3398 |
Female |
Male |
But all three of these mice are homozygous (or hemizygous) BTBR across the entire X chromosome, and so the X chromosome genotypes are compatible with either sex. Thus, there are no real discrepancies between the X chromosome genotypes and the sex of the mice, once the labels for the DNA samples have been corrected.
RR |
BR |
BY |
RY |
|
Mouse3368 |
0 |
0 |
0 |
20 |
Mouse3393 |
0 |
0 |
0 |
20 |
Mouse3399 |
19 |
0 |
0 |
0 |
5.2. Gene expression
We can also look at the expression of the Xist gene (which should be high in females and 0 in males) and of Y chromosome genes (which should be >0 in males and 0 in females).
In the following, I plot the average expression of 3 Y chromosome genes against the expression of Xist, with points colored by the sex of the sample (blue is male; red is female). (The expression arrays contained 21 probes for genes on the Y chromosome, but only a portion of them showed clear sex differences in expression in all tissues, and so I focused on these.)
Hypo is a bit of a mess, with a bunch of females having unusually low Xist expression and a bunch of males having unusually low Y chromosome expression; this is almost entirely explained by a bad batch of arrays that we ultimately removed.
The other five tissues look great. There is one case of a male with relatively high Xist expression in adipose, but the sex assignments of the expression arrays, after the fixing labels, appears to be correct.
The following is the plot for hypo, without the bad batch of arrays:
6. QTL mapping results, before and after
I now will compare the QTL mapping results, before and after the correction of the sample mixups. I will look at insulin (a particularly important trait), agouti coat color and tufted coat (both simple Mendelian phenotypes), and eQTL for all tissues.
I first need to re-align the genotype data. I will drop all samples whose labels are not clear (that includes any DNA without associated expression data).
One problem with the data in the original form are the 28 mice whose X chromosome genotypes don’t match their sex. For these, I will use the correct sex and omit the X chromosome genotype data.
6.1. Insulin
First, I’ll look at insulin levels at 10 weeks; we actually have three different measures of insulin at 10 weeks, and we also have insulin at 4, 6 and 8 weeks, but I’ll just look at one of the 10 week measures. I’ll use the log scale and will include sex as an interactive covariate (that is, allowing QTL to have different effect in the two sexes).
In the original data, there were 539 mice; in the corrected data, we are left with 519 mice. The following are the LOD curves, with those for the original data in red and those for the corrected data in blue.
With the original data, there were 3 chromosomes having a LOD score ≥ 5; with the corrected data, there are 7 chromosomes with LOD ≥ 5.
Here is a summary table of the peak LOD on chromosomes with inferred QTL.
Chr |
LOD (old) |
LOD (new) |
Pos’n (old) |
Pos’n (new) |
Increase in LOD |
2 |
7.05 |
9.21 |
70.6 |
71.0 |
2.16 |
5 |
3.00 |
5.08 |
56.2 |
54.7 |
2.09 |
6 |
3.56 |
5.30 |
43.6 |
43.1 |
1.74 |
7 |
4.73 |
6.31 |
87.4 |
86.0 |
1.58 |
9 |
3.31 |
6.72 |
49.0 |
46.2 |
3.41 |
12 |
5.58 |
5.83 |
10.5 |
10.0 |
0.25 |
19 |
6.59 |
6.79 |
43.8 |
44.3 |
0.20 |
The LOD scores on chromosomes 2, 5, 6, 7, and 9 went up substantially. The peak locations are largely unchanged.
6.2. Agouti coat
For most mice, it was recorded whether they had an agouti coat or not. This is due to a single gene (a, non-agouti) on chromosome 2 at 154.6-154.8 Mbp.
In both the original and corrected data, agouti coat color maps strongly to chromosome 2. The LOD score increased from 64 to 110 with the correction of the sample mixups. In both cases, the peak is at 154.4 Mbp, right next to the true gene.
If we grab the mice with good genotype data at the agouti locus, we obtain the following relationship between agouti genotype and phenotype for the original data (with alleles B = B6 and R = BTBR):
Agouti |
|||
Tan |
Black |
||
Chr 2 genotype |
BB |
26 |
114 |
BR |
249 |
15 |
|
RR |
88 |
6 |
|
With the corrected data, we get the following table.
Agouti |
|||
Tan |
Black |
||
Chr 2 genotype |
BB |
5 |
126 |
BR |
255 |
2 |
|
RR |
92 |
0 |
|
With the original data, there are 47 mismatches out of 498 mice. With the corrected data, there are 7 mismatches out of 480 mice.
So there are still 7 mismatches remaining, but the results are considerably improved.
Here are the actual mice with mismatching agouti genotype/phenotype:
pheno |
geno |
|
Mouse3159 |
Tan |
BB |
Mouse3179 |
Black |
BR |
Mouse3180 |
Tan |
BB |
Mouse3225 |
Tan |
BB |
Mouse3394 |
Tan |
BB |
Mouse3507 |
Tan |
BB |
Mouse3586 |
Black |
BR |
The genotypes for the mismatching individuals seem clear and are not likely genotyping errors. Here are the genotypes on chromosome 2 for these seven mice, with the agouti locus indicated by a green vertical line. The top five have tan coats and should be BR (gray points) or RR (black points) at the agouti locus, but are all BB (white points). The bottom two have black coats and should be BB (white points) at the agouti locus but are BR (gray points).
6.3. Tufted coat
For most mice, we also have information on whether they had a tufted coat or not. This is due to a single gene on chromosome 17, though it seems that the actual gene is not yet known.
In both the original and corrected data, tufted coat maps strongly to chromosome 17, though the LOD score increased from 64 to 107 with the correction of the sample mixups. In both cases, the peak is at marker rs3700924, at 27.2 Mbp.
If we grab the mice with good genotype data at the tufted locus, we obtain the following relationship between tufted genotype and phenotype for the original data (with alleles B = B6 and R = BTBR):
Tufted coat |
|||
No |
Yes |
||
Chr 17 genotype |
BB |
151 |
7 |
BR |
258 |
9 |
|
RR |
21 |
92 |
|
With the corrected data, we get the following table.
Tufted coat |
|||
No |
Yes |
||
Chr 17 genotype |
BB |
153 |
0 |
BR |
256 |
0 |
|
RR |
4 |
106 |
|
With the original data, there are 37 mismatches out of 538 mice. With the corrected data, there are 4 mismatches out of 519 mice.
So there are still 4 mismatches remaining, but, as with agouti coat, the results are considerably improved after the correction for the sample mixups.
Here are the actual mice with mismatching tufted genotype/phenotype:
pheno |
geno |
|
Mouse3152 |
Not tufted |
RR |
Mouse3154 |
Not tufted |
RR |
Mouse3275 |
Not tufted |
RR |
Mouse3603 |
Not tufted |
RR |
The genotypes for the mismatching individuals seem clear and are not likely genotyping errors. Here are the genotypes on chromosome 17 for these four mice, with the tufted locus indicated by a green vertical line. These all have non-tufted coats and should be BB (white gray) or BR (gray points) but are all RR (black points) at the tufted locus.
6.4. eQTL analysis
Finally, we will map the expression of genes in all six tissues, with both the original data and the corrected data. We will focus on probes for which we were able to infer a genomic location, and on an autosome or the X chromosome (i.e., not the Y or mitochondria). There are 37797 such probes.
For all probes, we transform the expression measures to normal quantiles, and we include sex as an interactive covariate in the QTL analyses.
For the hypo expression data, we omitted a batch of 119 arrays that were found to be bad. For each transcript in each tissue, we used Haley-Knott regression to calculate LOD curves across the chromosome. For each transcript, we find the maximum LOD score on each chromosome and count the number of chromosomes that have LOD ≥ 5. A transcript is said to have a local-eQTL if the LOD score on its chromosome is ≥ 5 and the genomic location of the transcript is within the 2-LOD support interval for that eQTL; otherwise, a peak above 5 on that chromosome is considered a trans-eQTL.
Here is a figure with the numbers of local- and trans-eQTL for each tissue, with the original data and with the corrected data.
Across all tissues, there were 16946 local-eQTL and 69332 trans-eQTL with the original data. After correcting the sample mixups, these numbers increased to 18183 and 95152 for local-eQTL and trans-eQTL, respectively. These are increases of 7% and 37%, respectively.
6.5. eQTL analysis, LOD > 10
In response to a question from a reviewer, let’s repeat this, looking for eQTL with LOD ≥ 10.
Across all tissues, there were 9527 local-eQTL and 12262 trans-eQTL with the original data. After correcting the sample mixups, these numbers increased to 10339 and 17470 for local-eQTL and trans-eQTL, respectively. These are increases of 9% and 42%, respectively.
7. Summary
We have identified strong support for a large number of DNA sample mixups in this project; we further identified a much smaller number of sample mixups within each set of RNA samples. Fortunately, we are able to correct the majority of these problems and recover from the mistakes.
As should be expected, correcting these sample mixups leads to great improvement in the QTL mapping results, both for clinical phenotypes, such as insulin level, and in the expression traits.